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Abstract: In a lexicalized grammar formal- 
ism such as Lexicalized Tree- Adjoining Grammar 
(LTAG), each lexical item is associated with at 
least one elementary structure (supertag) that 
localizes syntactic and semantic dependencies. 
Thus a parser for a lexicalized grammar must 
search a large set of supertags to choose the right 
ones to combine for the parse of the sentence. We 
present techniques for disambiguating supertags 
using local information such as lexical preference 
and local lexical dependencies. The similarity 
between LTAG and Dependency grammars is ex- 
ploited in the dependency model of supertag dis- 
ambiguation. The performance results for vari- 
ous models of supertag disambiguation such as 
unigram, trigram and dependency-based models 
are presented. 

1 Introduction 

Part-of-speech disambiguation techniques (tag- 
gers) are often used to eliminate (or substan- 
tially reduce) the part-of-speech ambiguity prior 
to parsing. The taggers are all local in the sense 
that they use information from a limited context 
in deciding which tag(s) to choose for each word. 
As is well known, these taggers are quite success- 
ful. 

In a lexicalized grammar such as the Lexical- 
ized Tree- Adjoining Grammar (LTAG), each lex- 
ical item is associated with at least one elemen- 
tary structure (tree). The elementary structures 
of LTAG localize dependencies, including long 
distance dependencies, by requiring that all and 
only the dependent elements be present within 



the same structure. As a result of this localiza- 
tion, a lexical item may be (and, in general, al- 
most always is) associated with more than one 
elementary structure. We will call these ele- 
mentary structures supertags, in order to dis- 
tinguish them from the standard part-of-speech 
tags. Note that even when a word has a unique 
standard part-of-speech, say a verb (V), there 
will usually be more than one supertag associated 
with this word. Since when the parse is complete, 
there is only one supertag for each word (assum- 
ing there is no global ambiguity), an LTAG parser 
(Schabes, 1988) needs to search a large space of 
supertags to select the right one for each word 
before combining them for the parse of a sen- 
tence. It is this problem of supertag disambigua- 
tion that we address in this paper. 

Since LTAGs are lexicalized, we are presented 
with a novel opportunity to eliminate or substan- 
tially reduce the supertag assignment ambigu- 
ity by using local information such as local lex- 
ical dependencies, prior to parsing. As in stan- 
dard part-of-speech disambiguation, we can use 
local statistical information in the form of n-gram 
models based on the distribution of supertags in 
a LTAG parsed corpus. Moreover, since the su- 
pertags encode dependency information, we can 
also use information about the distribution of dis- 
tances between a given supertag and its depen- 
dent supertags. 

Note that as in standard part-of-speech disam- 
biguation, supertag disambiguation could have 
been done by a parser. However, carrying out 
part-of-speech disambiguation prior to parsing 
makes the job of the parser much easier and 



therefore speeds it up. Supertag disambigua- 
tion as proposed in this paper reduces the work 
of the parser even further. After supertag dis- 
ambiguation, we have effectively completed the 
parse and the parser need 'only' combine the indi- 
vidual structures; hence the term- almost parsing. 
This method can also be used to parse sentence 
fragments in cases where the supertag sequence 
after the disambiguation may not combine into a 
single structure. 

The main goal of this paper is to present 
techniques for disambiguating supertags, and to 
evaluate their performance and their impact on 
LTAG parsing. Although presented with respect 
to LTAG, these techniques are applicable to lex- 
icalized grammars in general. Section 2 provides 
an introduction to Lexicalized Tree Adjoining 
Grammars. The objective of supertag disam- 
biguation is illustrated through an example in 
Section 3. Section 4 briefly describes the sys- 
tem used to collect the data needed for supertag 
disambiguation. Various methods and their per- 
formance results for supertag disambiguation are 
discussed in detail in Section 5. 

2 Lexicalized Tree Adjoining 
Grammars 

Lexicalized Tree Adjoining Grammar (LTAG) is 
a lexicalized tree rewriting grammar formalism. 
The primary structures of LTAG are Elemen- 
tary TREES. Each elementary tree has a lexi- 
cal item (anchor) on its frontier and provides an 
extended domain of locality over which the an- 
chor specifies syntactic and semantic (predicate- 
argument) constraints. Elementary trees are 
of two kinds: Initial Trees and Auxiliary 
Trees. Examples of initial trees (as) and auxil- 
iary trees (/3s) are shown in Figure 1. Nodes on 
the frontier of initial trees are marked as substitu- 
tion sites by a '!', while exactly one node on the 
frontier of an auxiliary tree, whose label matches 
the label of the root of the tree, is marked as a 
foot node by a V. The other nodes on the fron- 
tier of an auxiliary tree are marked as substitu- 
tion sites. LTAG factors recursion from the state- 
ment of the syntactic dependencies. Elementary 
trees (initial and auxiliary) are the domain for 
specifying dependencies. Recursion is specified 
via the auxiliary trees. Elementary trees are com- 
bined by the Substitution and Adjunction op- 
erations. Substitution inserts elementary trees at 



the substitution nodes of other elementary trees. 
Adjunction inserts auxiliary trees into elemen- 
tary trees at the node whose label is the same as 
the root label of the auxiliary tree. As an exam- 
ple, the component trees ( ag, 012, «3, «4, fig, «5, 
oiq), shown in Figure 1 can be combined to form 
the sentence John saw a man with the telescope 1 
as follows: 

1. a$ substitutes at the NPo node in a.^- 

2. CI3 substitutes at the DetP node in CI4, the 
result of which is substituted at the NPi 
node in a.^- 

3. CI5 substitutes at the DetP node in oiq, the 
result of which is substituted at the NP node 
in fig. 

4. The result of step (3) above adjoins to the 
VP node of the result of step (2). The re- 
sulting parse tree is shown in Figure 2(a). 

The process of combining the elementary trees 
resulting in the parse of the sentence is repre- 
sented by the derivation tree, shown in Fig- 
ure 2(b). The nodes of the derivation tree are 
the tree names that are anchored by the appro- 
priate lexical item. The composition operation 
is indicated by the nature of the arcs - dashed 
line for substitution and bold line for adjunction, 
while the address of the operation is indicated as 
part of the node label. The derivation tree can 
also be interpreted as a dependency graph with 
unlabeled arcs between words of the sentence as 
shown in Figure 2(c). 

We will call the elementary structures asso- 
ciated with each lexical item as super parts-of- 
speech (super POS) or supertags. 

3 Example of Supertagging 

As a result of localization in LTAG, a lexical item 
may be associated with more than one supertag. 
The example in Figure 3 illustrates the initial set 
of supertags assigned to each word of the sentence 
John saw a man with the telescope. The order 
of the supertags for each lexical item in the ex- 
ample is not significant. Figure 3 also shows the 
final supertag sequence assigned by the supertag- 
ger, which picks the best supertag sequence using 

J The parse with the PP attached to the NP has not 
been shown. 




Figure 1: Elementary trees of LTAG 
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Figure 2: Structures of LTAG 
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Final Assignment: a$ a 2 C13 C14 fs a 5 a 6 
Figure 3: Supertag Assignment for John saw a man with the telescope 



statistical information (described in Section 4) 
about individual supertags and their dependen- 
cies on other supertags. The chosen supertags 
are combined to derive a parse, as explained in 
Section 2. 

Without the supertagger, the parser would have 
to process combinations of the entire set of trees 
(28); with it the parser must only processes com- 
binations of 7 trees. 

4 Data Collection 

The data required for disambiguating supertags 
(discussed in Section 5) have been collected by 
parsing the Wall Street Journal 2 , IBM-manual 
and ATIS corpora using the wide-coverage En- 
glish grammar being developed as part of the 
XTAG system (Doran et. al., 1994). The parses 
generated by the system for these sentences from 
the corpora are not subjected to any kind of fil- 
tering or selection. All the derivation structures 
are used in the collection of the statistics. 

4.1 About XTAG 

XTAG is a large ongoing project to develop a 
wide-coverage grammar for English, based on the 
LTAG formalism. It also serves as an LTAG 
grammar development system and consists of a 
predictive left-to-right parser, an X-window in- 
terface, a morphological analyzer and a part-of- 
speech tagger. The wide-coverage English gram- 
mar of the XTAG system contains 317,000 in- 
flected items in the morphology (213,000 of these 
are nouns and 46,500 are verbs) and 37,000 en- 
tries in the syntactic lexicon. The syntactic lex- 
icon associates words with the trees that they 
anchor. There are 385 trees in all, in a grammar 
which is composed of 40 different subcategoriza- 
tion frames. Each word in the syntactic lexicon, 
on the average, depending on the standard parts- 
of-speech of the word, is an anchor for about 8 
to 40 elementary trees. 

5 Models, Experiments and 
Results 

The supertag statistics which have been used 
in the preliminary experiments described below 
have been collected from the XTAG parsed cor- 
pora. The derivation structures resulting from 

2 Sentences of length < 15 words 



parsed corpora (Wall Street Journal, for the ex- 
periments described here) serve as training data 
for these experiments. 

5.1 Unigram model 

One method of disambiguating the supertags as- 
signed to each word is to order the supertags by 
the lexical preference that the word has for them. 
The frequency with which a certain supertag is 
associated with a word is a direct measure of its 
lexical preference for that supertag. Associating 
frequencies with the supertags and using them 
to associate a particular supertag with a word 
is clearly the simplest means of disambiguating 
supertags. Thus, 

Supertag(w 8 ) = t k 3 argmax^ unigram(4 | w 8 ). 

5.1.1 Experiments and Results 

Owing to sparseness of data, we have backed-off 
from word/supertag pairs to part-of- 
speech/supertag pairs, i.e., collected the unigram 
frequencies of supertags associated with the part- 
of-speech assigned to words instead of the words 
themselves. Table 1 illustrates the nature of the 
statistics used, with a few sample entries. 



Part-of-speech 


(supertag, unigram probability) 




(ai, 0.218) 


N 


(a 8 , 0.375) 




(p 2 , 0.282) 


V 


(a 2 , 0.099) 


D 


(a 3 , 0.963) 



Table 1: Sample entries of unigram database 



Top n Supertags 


% Success 


n = 1 


15% 


n = 2 


22% 


n = 3 


52% 



Table 2: Results from the Unigram Supertag 
Model 

The words are first assigned standard parts- 
of-speech using a conventional tagger (Church, 
1988). Then the set of supertags associated with 
each word is retrieved from XTAG's syntactic 
database. These supertags are ordered based on 
their unigram frequency, and the top n supertags 
are associated with the word. Table 2 summa- 
rizes the success percentage on a held out test 
set of 100 Wall Street Journal sentences, as n is 
varied. If a sentence parses using the n supertags 



selected for each word then the assignment is con- 
sidered a success. 

The unigram supertagger that selects top three 
supertags has been interfaced with XTAG. This 
speeds the runtime of the parser by 87% on the 
average, whenever the supertagger succeeds. 

5.2 n-gram model 

In a unigram model a word is always associated 
with the supertag that is most preferred by the 
word, irrespective of the context in which the 
word appears. An alternate method that is sen- 
sitive to context is the n-gram model. The n- 
gram model takes into account the contextual de- 
pendency probabilities between supertags within 
a window of n words in associating supertags 
with words. Thus the most probable supertag 
sequence for a N word sentence is given by 

f = argmax T Pr(Ti,T 2 ,. . .,2V) * 

Ft(W!,W 2 ,. . .,W N \T!,T 2 ,. . .,T N ) 

To compute this using only local information, 
we approximate, taking the probability of a word 
to depend only on its supertag 

V*{W U W 2 ,. . .,W N \T U T 2 ,. . .,T N ) 

« UfLi Pr(W,- | Ti) 

and also use an n-gram (trigram, in this case) 
approximation 

Pr(2\,T 2 ,. . .,T N ) « UfLi Pr(T t - | T,-_ 2 , T;_i) 

5.2.1 Experiments and Results 

A trigram model has been used to model the 
contextual dependencies in supertag sequences. 
Again, due to sparseness of data, the particu- 
lar words have been ignored and the training of 
the trigram model has been done on the part-of- 
speech/supertag pair. The model has been tested 
on the same set of held out sentences as in the 
unigram experiment. The percentage success is 
68%, i.e., 68% of the words of the test corpus 
were assigned the correct supertag. 

5.3 Dependency model 

In the n-gram model for disambiguating su- 
pertags, dependencies between supertags that 
appear beyond the n word window cannot be in- 
corporated into the model. This limitation can 
be overcome if no a priori bound is set on the size 



of the window but instead a probability distribu- 
tion of the distances of the dependent supertags 
for each supertag is maintained. A supertag is 
dependent on another supertag if the former sub- 
stitutes or adjoins into the latter 3 . 

5.3.1 Experiments and Results 

Table 3 shows the data required for the depen- 
dency model of supertag disambiguation. Ide- 
ally each entry would be indexed by a (word, su- 
pertag) pair but, due to sparseness of data, we 
have backed-off to a (POS, supertag) pair. Each 
entry contains the following information. 

• POS and Supertag pair. 

• List of + and — , representing the direction of 
the dependent supertags with respect to the 
indexed supertag. (Size of this list indicates 
the total number of dependent supertags re- 
quired.) 

• Dependent supertag. 

• Signed number representing the direction 
and the ordinal position of the particular 
dependent supertag mentioned in the entry 
from the position of the indexed supertag. 

• A probability of occurrence of such a depen- 
dency. The sum probability over all the de- 
pendent supertags at all ordinal positions in 
the same direction is one. 

For example, the fourth entry in the Table 3 
reads that the tree a 2 , anchored by a verb (V), 
has a left and a right dependent ( — , +) and the 
first word to the left ( — 1), with the tree ag, is 
dependent on the current word. The strength of 
this association is represented by the probability 
0.300. 

The dependency model of disambiguation 
works as follows. Suppose a 2 is a member of the 
set of supertags associated with a word at posi- 
tion n in the sentence. The algorithm proceeds 
to satisfy the dependency requirement of a 2 by 
picking up the dependency entries for each of the 

3 We are computing dependencies between words with 
respect to supertags associated with the words, although 
the complete structure of the supertags is not used. It is of 
interest to compare our work with some other dependency- 
based approaches as described by, for example, Sleator 
(Sleator and Temperley, 1990), Hindle (Hindle, 1993), Mil- 
ward (Milward, 1992). 
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Dependent 


Dependent 


Ordinal 




(P. O.S, Supertag) 


Supertag 


Supertag 


position 


Prob 


(D,a 5 ) 











(N,a 8 ) 











(N,«i) 


(-) 


a 3 


-1 


0.999 


(V,a 2 ) 


(- +) 


a 8 


-1 


0.300 


(V,a 2 ) 


(- +) 


a 8 


1 


0.374 



Table 3: Dependency Data 



directions. It picks a dependency data entry (the 
fourth entry, say) from the database that is in- 
dexed by a.2 and proceeds to set up a path with 
the first word to the left that has the dependent 
supertag (a.%) as a member of its set of supertags. 
If the first word to the left that has as a mem- 
ber of its set of supertags is at position to, then an 
arc is set up between a^ and a.%. Also, the arc is 
verified not to kite-string-tangle 4 with any other 
arcs in the path up to a^- The path probability 
up to a.2 is incremented by log 0.300 to reflect the 
success of the match. The path probability up to 
a% incorporates the unigram probability of ag. 
On the other hand, if no word is found that has 
as as a member of its set of supertags then the 
entry is ignored. The algorithm makes a greedy 
choice by selecting the path with the maximum 
path probability to extend to the remaining di- 
rections in the dependency list. A successful su- 
pertag sequence is one which assigns a supertag 
to each position such that each supertag has all 
of its dependents and maximizes the accumulated 
path probability. It is to be noted that the algo- 
rithm when pairing the head and its dependent 
is not really parsing since it does so even without 
looking at the structure of the string between the 
head and the dependent. 

The implementation and testing of this model 
of supertag disambiguation is underway. Ta- 
ble 4 shows preliminary results on the same held 
out test set of 100 Wall Street Journal sentences 
that was used in the unigram and trigram mod- 
els. The table shows two measures of evaluation. 
In the first, the dependency link measure, the 
test sentences were independently hand tagged 
with dependency links and then were used to 
match the links output by the dependency model. 
The columns show the total number of depen- 
dency links in the hand tagged set, the number 

4 Two arcs (a,c) and (b,d) kite-string-tangle if a < b < 
c<(ior6<a<d<c. 



of matched links output by this model and the 
percentage correctness. The second measure, su- 
pertags, shows the total number of correct su- 
pertags assigned to the words in the corpus by 
this model. 



Criterion 


Total 
number 


Number 
correct 


% 
correct 


Dependency 
links 


815 


620 


76.07% 


Supertags 


915 


707 


77.26% 



Table 4: Results of Dependency model 



6 Conclusion 

Lexicalized grammars associate with each word 
richer structures (trees in case of LTAGs and cat- 
egories in case of Combinatory Categorial Gram- 
mars (CCGs)) over which the word specifies syn- 
tactic and semantic constraints. Hence every 
word is associated with a much larger set of 
more complex structures than in the case where 
the words are associated with standard parts- 
of-speech. However, these more complex de- 
scriptions allow more complex constraints to be 
imposed and verified locally on the contexts in 
which these words appear. This feature of lexi- 
calized grammars can be taken advantage of, to 
further reduce the disambiguation task of the 
parser, as shown in supertag disambiguation. 
Hence supertag disambiguation can be used as 
a general pre-parsing component of lexicalized 
grammar parsers. 

The degree of distinction between supertag dis- 
ambiguation and parsing varies, depending on 
the lexicalized grammar being considered. For 
both LTAG and CCG, supertag disambiguation 
serves as a pre-parser filter that effectively weeds 
out inappropriate elementary structures (trees or 
categories) given the context of the sentence. It 



also indicates the dependencies among the ele- 
mentary structures but not the specific operation 
to be used to combine the structures or the ad- 
dress at which the operation is to be performed - 
"an almost parse". In cases where the supertag 
sequence for the given input string cannot be 
combined to form a complete structure, the "al- 
most parse" may indeed be the best one can do. 

In case of LTAG, even though no explicit 
substitutions or adjunctions are shown, the de- 
pendencies among LTAG trees uniquely iden- 
tify the combining operation between the trees 
and the node at which the operation can be 
performed is almost always unique 5 . Thus su- 
pertag disambiguation is almost parsing for LT- 
AGs. In contrast, the dependencies among the 
CCG categories do not result in directly identi- 
fying the combining operations between the cate- 
gories since two categories can often be combined 
in more than one way. Hence for CCG further 
processing needs to be performed to obtain the 
complete parse of the sentence, although without 
any supertag ambiguities. 

The supertag disambiguation, dependency 
model in particular, is even closer to parsing in 
dependency grammar formalism. Dependency 
parsers establish relationships among words, un- 
like the phrase-structure parsers which construct 
a phrase-structure tree spanning the words of 
the input. Since LTAGs are lexicalized and 
each elementary tree is associated with at least 
one lexical item, the supertag disambiguation 
for LTAG can therefore be viewed as establish- 
ing the relationship 6 among words as depen- 
dency parsers do. Then the elementary struc- 
tures that the related words anchor are combined 
to reconstruct the phrase-structure tree similar 
to the result of phrase-structure parsers. Thus 
the interplay of both dependency and phrase- 
structure grammars can be seen in LTAGs. Ram- 
bow and Joshi (Rambow and Joshi, 1993) dis- 
cuss in greater detail the use of LTAG in relating 
dependency analyses to phrase-structure analy- 
ses and propose a dependency-based parser for a 
phrase-structure based grammar. 



5 In some cases, the dependency information between 
an auxiliary and an elementary tree may be insufficient to 
uniquely identify the address of adjunction, if the auxiliary 
tree can adjoin to more than one node in the elementary 
tree, since the specific attachments are not shown. 

6 The relational labels between two words in LTAG is 
associated with the address of the operation between the 
trees that the words anchor. 



In summary, we have presented a new tech- 
nique that performs the disambiguation of su- 
pertags using local information such as lexical 
preference and local lexical dependencies. This 
technique, like part-of-speech disambiguation, re- 
duces the disambiguation task that needs to be 
done by the parser. After the disambiguation, 
we have effectively completed the parse of the 
sentence and the parser needs 'only' to complete 
the adjunction and substitutions. This method 
can also serve to parse sentence fragments in 
cases where the supertag sequence after the dis- 
ambiguation may not combine to form a single 
structure. We have implemented this technique 
of disambiguation using the n-gram models using 
the probability data collected from LTAG parsed 
corpus. The similarity between LTAG and De- 
pendency grammars is exploited in the depen- 
dency model of supertag disambiguation. The 
performance results of these models have been 
presented. 
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